Data visualization is all about putting data in a visual context like a map or a graph. It helps us to grasp knowledge more easily, as well as detect patterns, tendencies and exceptions in data clustering.
In this study I chose a Natural Language Processing Technique called Latent Semantic Indexing for clustering COVID-19 (SARS-CoV-2) related policy propositions in brazilian federal deputy chamber.
Latent Semantic Indexing (LSI) or Latent Semantic Analysis is a technique that enables us to observe relationships between terms in a document by assuming that terms that are similar in meaning are used in similar contexts.
How does it work
First you need to build a term-document matrix. It’s basically a matrix where every row represents a term, and every column represents a document. So let’s check some definitions.
document: it’s the generalized idea of word, sentence or text. It’s any piece of textural data I want to analyse. It can be a tweet, a YouTube comment, a Reddit Post, a paragraph from a book, an paper abstract, etc.
term: it’s every piece that builds up a document. Depending on what I’m looking for it can be every single word, every syllable, or any set of n words, always known as n-grams or any combination of those.
Lets say I have two documents:
“Victor likes apples and bananas.”
“My dog hates apples and bananas”
“My dog is an animal, but bananas are not”.
Next I will show you how the transposed (for better fitting in the page) version of a term-document matrix of a set of documents like that would look like this:
| victor | likes | apples | and | bananas | my | dog | hates | is | an | animal | but | are | not | |
| document_0 | x | x | x | x | x | |||||||||
| document_1 | x | x | x | x | x | x | ||||||||
| document_2 | x | x | x | x | x | x | x | x | x |
In this exemple I only used single words as terms for simplicity sake, but in the practical example I also use 2-grams. So besides that, I’d add to the terms list “victor likes”, “likes apples”, “apples and”, “and bananas”, and so on. The actual content of the term-document matrix depends on how you chose to measure the presence of each term. It can be done as simply as using 1 for “the document contains this term” and 0 for “the document does not contain this term”, or the frequency with which the term appears.
On Tf-Idf and term-document matrix
In this case I used a Term Frequency - Inverse Document Frequency weights (Tf-Idf) in the term-document matrix. Tf-Idf weights tell me how important a term is in a document (w). It basically rewards a term if it shows up a lot in a document, but penalizes it if it shows in a large number of documents. Basically, if all your documents are on animals, the world “animal” isn’t useful to understand the minutiae in a specific text. But the world “zebra” might be good to get what a specific text is about.
_On Single Value Decomposition
Now I have measures of how each document position in the semantic space, I can reduce the dimensionality by using an Linear Algebra technique called Singular Value Decomposition which allows me to reduce the number of rows of a sparse matrix (such as the one I get by vectorizing my documents to a tf-idf term-document matrix) losing the minimum information possible about the relation among each column. By doing that I’m able to visualize how distributed in space my documents are. In this example I choose to keep only 3 dimensions and use the color for showing clustering information.
_The next graphs show that there isn’t a pattern related to the content of the propositions and the political parties.
But it’s still possible to see how K-Means clustering can be applied to approach this problem.
You can check this and others implementation in my COVID-19 github repository repository.
Data visualization is all about putting data in a visual context like a map or a graph. It helps us to grasp knowledge more easily, as well as detect patterns, tendencies and exceptions in data clustering.
In this study I chose a Natural Language Processing Technique called Latent Semantic Indexing for clustering COVID-19 (SARS-CoV-2) related policy propositions in brazilian federal deputy chamber.
Latent Semantic Indexing (LSI) or Latent Semantic Analysis is a technique that enables us to observe relationships between terms in a document by assuming that terms that are similar in meaning are used in similar contexts.
How does it work
First you need to build a term-document matrix. It’s basically a matrix where every row represents a term, and every column represents a document. So let’s check some definitions.
document: it’s the generalized idea of word, sentence or text. It’s any piece of textural data I want to analyse. It can be a tweet, a YouTube comment, a Reddit Post, a paragraph from a book, an paper abstract, etc.
term: it’s every piece that builds up a document. Depending on what I’m looking for it can be every single word, every syllable, or any set of n words, always known as n-grams or any combination of those.
Lets say I have two documents:
“Victor likes apples and bananas.”
“My dog hates apples and bananas”
“My dog is an animal, but bananas are not”.
Next I will show you how the transposed (for better fitting in the page) version of a term-document matrix of a set of documents like that would look like this:
| victor | likes | apples | and | bananas | my | dog | hates | is | an | animal | but | are | not | |
| document_0 | x | x | x | x | x | |||||||||
| document_1 | x | x | x | x | x | x | ||||||||
| document_2 | x | x | x | x | x | x | x | x | x |
In this exemple I only used single words as terms for simplicity sake, but in the practical example I also use 2-grams. So besides that, I’d add to the terms list “victor likes”, “likes apples”, “apples and”, “and bananas”, and so on. The actual content of the term-document matrix depends on how you chose to measure the presence of each term. It can be done as simply as using 1 for “the document contains this term” and 0 for “the document does not contain this term”, or the frequency with which the term appears.
On Tf-Idf and term-document matrix
In this case I used a Term Frequency - Inverse Document Frequency weights (Tf-Idf) in the term-document matrix. Tf-Idf weights tell me how important a term is in a document (w). It basically rewards a term if it shows up a lot in a document, but penalizes it if it shows in a large number of documents. Basically, if all your documents are on animals, the world “animal” isn’t useful to understand the minutiae in a specific text. But the world “zebra” might be good to get what a specific text is about.
_On Single Value Decomposition
Now I have measures of how each document position in the semantic space, I can reduce the dimensionality by using an Linear Algebra technique called Singular Value Decomposition which allows me to reduce the number of rows of a sparse matrix (such as the one I get by vectorizing my documents to a tf-idf term-document matrix) losing the minimum information possible about the relation among each column. By doing that I’m able to visualize how distributed in space my documents are. In this example I choose to keep only 3 dimensions and use the color for showing clustering information.
_The next graphs show that there isn’t a pattern related to the content of the propositions and the political parties.
But it’s still possible to see how K-Means clustering can be applied to approach this problem.
You can check this and others implementation in my COVID-19 github repository repository.